Key topics: Time Series Analysis, Forecasting, Spatial Data, R Shiny Dashboard

This report contains an exploratory data analysis of cyclist violations in NYC from 2018 to June 30, 2023, as well as a forecast for cyclist counts to 2025. Everything was coded in R markdown files which can be viewed in the github repo. There is a supplemental R Shiny Dashboard that contains selectable maps and time-frames for all violations.

Introduction

While one can find a good deal of information about general cycling in NYC, it is more difficult to find information specifically about bike violations (also referred to as ‘tickets’ or ‘summons’). This report will examine some general facts obtained from available data about cycling violations in NYC, like the most common violations, violation changes over time, differences per borough, etc. It also will examine cyclist rider count data, looking at historic seasonal trends and forecasting future ridership.

Throughout this report, I will use the term ‘cyclist’ to refer to regular bike riders, e-bike riders, and e-scooter riders, and ‘bicycle’ to refer to any of the vehicles that fall under these labels.

Data

Datasets

The datasets used were all taken from the NYC Open Data website, except the violation code description data, which was taken from the NY DMV website. The violation data was merged from two different datasets, a ‘year to date’ dataset spanning Jan 1, 2023 to June 30th, 2023, and a ‘historical’ dataset dating from Jan 1, 2018 to December 31, 2022. These datasets contained all violations issued by the NYPD, but only the violations labeled ‘Bike’, ‘Ebike’, and ‘Escooter’ were chosen for this report. The bicycle count data was agglomerated per day and per week, and these values were appended to the merged violations dataset. The violation code textual description data was also merged with the violations dataset, so that a description of the violation code could easily be accessed.

Bike Counters

mapview(na.omit(df_bicycle_counters_boroughs), xcol = "longitude", ycol = "latitude", crs = 4269, grid = FALSE)

Bicycle counter locations throughout the 5 boroughs. Note that two on Amserdam Ave on the west side of Central Park have the exact same latitude and longitude in the .csv file, but in reality one is on Amsterdam ave counting bike traffic in the uptown direction and the other is an avenue over on Columbus Ave, counting bike traffic flowing downtown.

The bicycle count data was taken from 18 separate bike counters, spread throughout the 5 boroughs, as can be seen in the map above. Although a total of 29 were listed in the .csv file, some of those were either duplicates or not applicable (for example, they only counted pedestrians) and were removed from consideration. The majority of these counters could be considered ‘Manhattan-centric.’ For instance, there is only one counter in Staten Island by the ferry access to Manhattan, one in a single area in the south Bronx, and only three in Queens, with one of those placed near the Queensboro bridge to Manhattan. Based on this unequal representation, I chose not to perform any borough specific analyses involving rider count data (however, violation data was not dependent on bike counters), and any generalizations or take-aways from this report should keep these limitations in mind. The bike counters acted by counting the number of cyclists that crossed them, and recorded this aggregation every 15 minutes. The bike counter data ranged from 2012 to present, but only the data from 2018 to present were used.

Removal of NA values

Of the 126,812 initial violation entries in the specified time frame, 111 total entries were removed, leaving 126,701. Sixteen rows did not contain violations codes, 92 rows did not contain city name or location information, and 3 rows did not contain location information.

EDA: Cyclist Counts

Daily Total Cyclists

ts_daily_total |>
  ggplot(aes(x = daily_total_cyclists)) +
  geom_histogram(bins = 80) +
  labs(title='Daily Total Cyclists Histogram') +
  xlab('Cyclists Per Day') +
  ylab('Count')

df_bike_violations |>
  mutate(violation_date = date(violation_date)) |>
  arrange(desc(daily_total_cyclists)) |>
  distinct(violation_date, daily_total_cyclists) |>
  slice(1:10) |>
  kable(caption = "Top 10 Busiest Cycling Days", 
        col.names = linebreak(c('Violation Date', 'Daily Total Cyclists'), align = "l")) |> 
  kable_classic(full_width = F, html_font = "Cambria", position = "left")
Top 10 Busiest Cycling Days
Violation Date Daily Total Cyclists
2019-10-30 59750
2019-11-04 59291
2019-11-05 58329
2020-09-12 57637
2019-11-27 57456
2019-07-16 57394
2019-11-06 57384
2019-11-26 57323
2020-06-13 57195
2020-11-07 56453

The amount of total cyclists per day ranged from 1665 to almost 60,000. The busiest days tended to be in late October/November of 2019/2020, with a few in the middle of summer and a few in September. Further investigation could help uncover why this is, but my guess is that people might be trying to squeeze in one last ride before winter.

Daily Total Cyclists over Time

ts_daily_total |>
  autoplot(daily_total_cyclists) +
  geom_smooth(method = "lm") +
  ggtitle('Daily Total Cyclists Time Plot') +
  xlab('Violation Date') +
  ylab('Daily Totaly Cyclists')

From the plot of daily total cyclists a general increasing trend can be seen, as well as strong seasonality. There are more cyclists in the warmer summer months than in the colder months. The slope for the regression line is ~6.54, which indicates that there are on average 6.5 new riders per day.

Seasonality

Let’s further explore the seasonality. It’s easier to view this with monthly totals:

ts_monthly_total |>
  gg_season(monthly_total_cyclists) +
  ggtitle('Seasonal plot of total monthly ridership') +
  xlab('Violation Month') +
  ylab('Monthly Total Cyclists')

ts_monthly_total |>
  gg_subseries(monthly_total_cyclists) +
  ggtitle('Subseries plot of total monthly ridership') +
  xlab('Violation Month') +
  ylab('Monthly Total Cyclists')

The increase in ridership in the warmer months can clearly be seen. Also note the yearly increase, as well as the dip in April of 2020 during covid lockdown.

Busiest Bike Counters

# Note: Count values were copied and pasted from 2_NYPD_Bike_viol_EDA.rmd.  If that file is
# updated, make sure to update these values.

location <- c('Williamsburg side of Williamsburg bridge', 
              'Queens side of Queensboro Bridge', 
              'Manhattan side of Manhattan bridge', 
              'Kent ave in Williamsburg', 
              'Brooklyn Bridge')
count_2022 <- c(1964902, 1818163, 1584788, 1066870, 1002070)

df_busiest_counters <- data.frame(location, count_2022)

df_busiest_counters |> 
  kable(caption = "Busiest Bike Counters, 2022", 
        col.names = linebreak(c('Counter Location', 'Total Counted Cyclists, 2022'), align = "l")
        ) |> 
  kable_classic(full_width = F, html_font = "Cambria", position = "left") |> 
  row_spec(1, bold=T)
Busiest Bike Counters, 2022
Counter Location Total Counted Cyclists, 2022
Williamsburg side of Williamsburg bridge 1964902
Queens side of Queensboro Bridge 1818163
Manhattan side of Manhattan bridge 1584788
Kent ave in Williamsburg 1066870
Brooklyn Bridge 1002070

The busiest bike counters in 2022 were at bridge access points as well as Kent Avenue in Williamsburg, which feeds onto the Williamsburg bridge. For 2023 (through June 30th), the order is mostly the same, with Kent Avenue and Brooklyn Bridge switching places. An analysis of daily total count data from just the Williamsburg bridge counter revealed similar seasonal trends to the total count data from all counters.

EDA: Violations

Violations Over Time

ts_daily_total |>
  autoplot(daily_total_violations) +
  ggtitle("Totaly Daily Violations") +
  xlab('Violation Date') +
  ylab('Violation Count')

There were 126,701 total violations consisting of 188 different types of violations handed out to cyclists during the time frame this data was collected. As we can see from the Total Daily Violations plot, there was a sharp drop-off in daily violations handed out after covid lockdown (April, 2020). I do not know why this is, and more information would be needed. Some seasonality is present as well, although not as pronounced as was seen in the count data. Fewer violations were handed out around the Christmas and New Year’s holidays, while more were handed out in the late summer/early autumn.

Violations Per Borough

df_bike_violations |> 
  group_by(year=year(violation_date), city_nm) |> 
  ggplot(aes(x=year, fill=city_nm)) +
  geom_bar() +
  labs(title="Yearly Cyclist Violations", 
       subtitle = 'Through June 30, 2023',
       fill="Borough") +
  scale_fill_brewer() +
  xlab('Year') +
  ylab('Violations')

Per borough, we can see that Manhattan had the most violations, with Brooklyn coming in second. More information is required to determine why. It’s possible that the police presence in Manhattan is higher. It is also possible that there are far more cyclists in Manhattan than in the other boroughs, although this can not be determined from the current data due to the disproportional bike counter placement. We can also see that the bump in violations from 2021 to 2022 was due to more violations being handed out in Brooklyn - Brooklyn increased from 2021 to 2022, while Manhattan decreased.

Most Common Violations

df_bike_violations |>
  mutate(violation_date = date(violation_date)) |>
  summarise(total = sum(n()), .by = c(violation_code, description)) |>
  arrange(desc(total)) |>
  mutate(percent = 100 * round(total / sum(total), 3)) |>
  slice(1:10) |>
  kable(caption = 'Top 10 Violations', 
          col.names = linebreak(c('Violation Code', 'Description', 'Total', 'Percent'))) |> 
  kable_classic(full_width = F, html_font = "Cambria", position = "left")
Top 10 Violations
Violation Code Description Total Percent
1111D1C BICYCLE OR SKATEBOARD FAILED TO STOP AT RED LIGHT- NYC 55933 44.1
1110AB DISOBEYED TRAFFIC DEVICE WHILE OPERATING BICYCLE 17166 13.5
1127AB DRIVING WRONG DIRECTION ON ONE-WAY STREET - BICYCLE 7995 6.3
403A3IX BICYCLE FAILED TO YIELD TO VEHICLE/PEDESTRIAN AT RED LIGHT- NYC 6682 5.3
37524AB OPER BICYCLE WITH MORE 1 EARPHONE 6052 4.8
1236B NO BELL OR SIGNAL DEVICE ON BICYCLE 3959 3.1
412P1 BIKING OFF LANE- NYC 3673 2.9
407C31 BIKE/SKATE ON SIDEWALK-NYC 3459 2.7
1232A IMPROPER OPERATION OF BICYCLE 2449 1.9
1111D1N NYC REDLIGHT 1925 1.5

From the Top 10 Violations table, we can see that improper traffic behavior at red lights (1111D1C, 403A3IX), account for ~50% of all cyclist violations. Other common violations include driving in the wrong direction, operating a bicycle with more than 1 earphone, not having a bell, and riding on the sidewalk.

df_bike_violations |>
  mutate(fct_lump = fct_lump(f = df_bike_violations$violation_code, n = 9)) |>
  mutate(date = date(violation_date)) |>
  filter(date >= "2022-06-01") |>
  filter(fct_lump == "1111D1C") |>
  mapview(xcol = "longitude", ycol = "latitude", crs = 4269, grid = FALSE, alpha.regions = 0.2)

Looking at one year’s worth of only the most common violation (1111D1C, since June 2022), we can see a few cluster areas emerge:

  • Upper West and Upper East sides of Manhattan (and their corresponding avenues up through Harlem)
  • Up and down 1st and 2nd Avenues, especially in the East Village
  • On the Brooklyn side of the Williamsburg Bridge
  • Up and down 4th and 5th Avenues in Brooklyn
  • Liberty Ave in Ozone Park/ Richmond Hill
  • The East side of Prospect Park on Bedford Avenue
  • Bensonhurst, Brooklyn

Notable Violation

This specific violation - Code 12332, ATTACHING SELF TO MOVING MOTOR VEHICLE, which occurred only once on May 1 2018 at 15:07:17 - is highly notable, as it is direct proof that at least one famous time traveller graced us with his presence near West 57th st and 9th Ave:

df_bike_violations |> 
  filter(violation_code==12332) |> 
  mapview(xcol = "longitude", ycol = "latitude", crs = 4269, grid = FALSE)

Busiest Violation Days

df_bike_violations |>
  mutate(violation_date = date(violation_date)) |>
  group_by(violation_date) |> 
  summarise(daily_total_violations = n()) |> 
  arrange(desc(daily_total_violations)) |>
  slice(1:10) |>
  kable(caption = "Top 10 busiest violation days", 
        col.names = linebreak(c('Violation Date', 'Daily Total Violations'))) |> 
  kable_classic(full_width = F, html_font = "Cambria", position = "left")
Top 10 busiest violation days
Violation Date Daily Total Violations
2019-08-15 324
2019-08-09 312
2019-10-02 308
2018-07-31 306
2018-06-19 302
2018-05-03 297
2019-10-01 291
2019-04-25 290
2018-08-29 284
2018-04-12 275

The busiest violation days were mostly on summer days in 2018 and 2019. This is consistent with our data that shows that there was a sharp drop-off in violations after covid lockdown, and that more ridership and violations are handed out in the summer months than in the colder months.

Relationship between Violations and Ridership

Are there more violations handed out to cyclists on days of higher ridership? Or, in other words, as ridership increases, do violations also increase? To answer this, a correlation coefficient between the two quantities was calculated as 0.05, indicating that there was no relationship. However, after accounting for covid lockdown, a correlation coefficient of 0.45 was calculated for pre-covid lockdown, and a correlation coefficient of 0.35 was calculated for post-covid lockdown, indicating a positive relationship. However, there is still a lot of unaccounted for variation, which tells us that there are more factors invovlved in this relationship.

Time Series Forecasting: Rider Counts

Since the seasonality for rider counts is yearly, I used monthly total ridership instead of daily total ridership for modeling. Daily total ridership proved to be too noisy.

The training set consisted of data from the beginning of the data set, Jan 2018, through May 2022. The test set contained the data from June 2022 to June 2023.

Five different models for monthly bike counts were attempted: A seasonal naive model, three exponential smoothing models (ETS AAA, ETS AAdA, and the ETS function’s choice (ETS MNM)) as well as the ARIMA function’s choice (ARIMA (1,0,0)(1,1,0)[12]).

Since there was strong seasonality, I decided to focus on exponential smoothing models. An ARIMA and seasonal naive forecast were modeled for comparison. When allowing the ETS function to choose what it thought would be the best model, it chose a multiplicative ETS model (MNM). I do not know why the function chose this. This highlights the importance of understanding analysis techniques when choosing the best model, and not leaving the decision up to an algorithm.

# These values were copied and pasted from 3_NYPD_Bike_viol_Forecasting.RMD.  If that file is updated, make sure to double-check these values:

tests <- c('Seasonal Naive', 'ETS_AAA', 'ETS_AAdA', 'ETS_MNM',  'ARIMA (1,0,0)(1,1,0)[12]')
RMSE <- c('139936.1', '86248.05', '94780.86', '116723.8', '114677.4')

df_forecasting <- data.frame(tests, RMSE)

df_forecasting |> 
  kable(caption = "Forecasts", 
        col.names = linebreak(c('Forecasting Method', 'RMSE on Test Set'))) |> 
  kable_classic(full_width = F, html_font = "Cambria", position = "left") |> 
  row_spec(2, bold=T)
Forecasts
Forecasting Method RMSE on Test Set
Seasonal Naive 139936.1
ETS_AAA 86248.05
ETS_AAdA 94780.86
ETS_MNM 116723.8
ARIMA (1,0,0)(1,1,0)[12] 114677.4

From the Forecasts table, we can see that ETS AAA (Exponential Smoothing with additive errors, additive trend, and additive seasonality), also known as Holt-Winter’s Additive forecasting, fit the test data the best when evaluated on RMSE. This is consistent with the trends that we noticed earlier. The magnitude of the seasonality remains fairly constant, which would call for an additive seasonality in our model. If the seasonality grew in magnitude with time (for instance, if the difference between the winter ridership and summer ridership grew with each year), then multiplicative seasonality might have performed better. The magnitude of the trend also appears fairly constant (and linear), which would call for an additive trend in our model.

Model Diagnostics

# separate train and test sets
set.seed(99)
train_monthly <- ts_monthly_total |> select(monthly_total_cyclists) |> filter_index(~"2022-05")
test_monthly <- ts_monthly_total |> select(monthly_total_cyclists) |> filter_index("2022-06"~.)

# model train set
mod_ETS_AAA <- train_monthly %>%
  model(ETS(monthly_total_cyclists ~ error("A") + trend("A") + season("A")))

# plot residuals
gg_tsresiduals(mod_ETS_AAA) +
  ggtitle('ETS AAA Residuals')

The residuals for this model do not show anything out of the ordinary. They fail to reject the null for the box pierce (p = 0.3329984) and ljung-box (p = 0.1743495) tests, indicating that the residuals are white noise. They also fail to reject the null for the Shapiro-Wilks test (p = 0.1783), indicating that the they are normally distributed.

Forecast Plot to 2025

Lastly, here is a plot utilizing our ETS_AAA model to predict ridership into 2025, with 80% and 95% errors being shown:

fct_ETS_AAA <- train_monthly |> 
  model(ETS(monthly_total_cyclists ~ error("A") + trend("A") + season("A"))) |> 
  forecast(h=36)

ts_monthly_total |> 
  autoplot(monthly_total_cyclists) +
  autolayer(fct_ETS_AAA, size=0.7) +
  autolayer(test_monthly) +
  ggtitle('ETS AAA Forecast of Total Monthly Cyclists to 2025') +
  xlab('Month') +
  ylab('Monthly Total Cyclists')

Conclusion and Important Takeaways

  • While ridership has been steadily increasing, with an average of 6.5 new riders per day, violations have noticeably decreased after covid lockdown - more violations were handed out before covid lockdown than after, despite rider increase.

  • Bridge access is where bike counters count the most riders, highlighting the importance of these transportation access points.

  • The Williamsburg Bridge bike was the busiest bike route with a bike counter in 2022.

  • The busiest cycling days have mostly been in the late autumn, while the busiest violation days have been in the summer, before the covid lockdown.

  • Seasonally, less violations are handed out around New Year’s.

  • Manhattan has the most violations handed out, by far, with Brooklyn coming in second.

  • The most common cycling violation, making up 50% of violations, involved illegal behavior at red lights and stop signs. Others include ‘disobeying traffic device’, riding the wrong way, having more than one earphone in, not having a bell, and riding on the sidewalk.

With bike ridership forecasted to continue its growing trend, the importance of a clearer understanding of bike violations will also increase. To facilitate this, more bike counters placed around the surrounding boroughs of Manhattan would give more accurate bike counts. Additionally, further spatial data analysis could help determine and predict cluster areas for different violations.